Things covered
Bandits of the k-Armed Bandit problem (test-bed)import numpy as np
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from datetime import datetime
This is a simple way of letting the agents balance their ability to Explore Vs Exploit
This is when we use the current knowledge of the Agent, to choose the next action.
This is when we let the Agent randomly choose an action in order to learn better by trying to know the unknown :-P, well so to say ! :-)
We provide a coldness parameter (just an inverse of the temperature parameter usually used in ML/AI), which defines how quickly/coldly our negative exponential function decays. We present 3 different epsilon/decay/coldness factors for an example of 2000 episodes
def epsilon(episode,coldness, ep_min=0,ep_max=1):
return ep_min + (ep_max - ep_min)*(np.exp(-coldness * episode))
fig = make_subplots(rows=1,cols=3)
fig.add_trace(go.Scatter(x=list(range(500)),
y=[epsilon(i,0.001) for i in range(500)],name="ep=0.001"),
row=1,col=1)
fig.add_trace(go.Scatter(x=list(range(500)),
y=[epsilon(i,0.01) for i in range(500)],name="ep=0.01"),
row=1,col=2)
fig.add_trace(go.Scatter(x=list(range(500)),
y=[epsilon(i,0.1) for i in range(500)],name="ep=0.1"),
row=1,col=3)
fig.update_layout(title_text="Epsilon Decay for 2000 episodes")
fig.show()
The K-Armed Bandit test bed has the following features (Note for simplicity for this notebook, we consider k=10 all through)
mean=0 and sigma=1. This can be seen from the bar plots presented belowmean = <built above> and sigma = 1. This can be seen from the bar plots presented below.def bandit_mean_distribution(K):
k_bandit_means = np.random.normal(0.0,1.0,K)
return k_bandit_means
k_means = bandit_mean_distribution(10)
bandit_data = list()
for idx,m in enumerate(k_means):
bandit_data.append(np.random.normal(m,1.0,1000).tolist())
fig = go.Figure()
for i in range(10):
fig.add_trace(go.Box(y=bandit_data[i],name="bandit_"+str(i)))
fig.show()
def get_new_q_dict():
q_d = dict()
for k in range(10):
q_d[k]=dict()
q_d[k]["q_value"] = 0
q_d[k]["n_exec"] = 0
return q_d
def execute(n_episodes,n_steps,decay,update_func):
ep_explore = list()
ep_exploit = list()
ep_avg_reward = list()
q_dict = get_new_q_dict()
for ep in range(n_episodes):
if (ep+1) % 500 == 0:
print("Done with %d episodes" %((ep+1)))
explore = 0
exploit = 0
reward = 0
for t in range(n_steps):
r = np.random.random()
eps = epsilon(ep,decay)
if r >= eps:
# Exploit #
exploit+=1
action = np.argmax([q_dict[k]["q_value"] for k in range(10)])
current_reward = np.random.choice(bandit_data[action])
q_dict[action]["n_exec"] += 1
q_dict[action]["q_value"] = update_func(q_dict[action]["q_value"],current_reward, q_dict[action]["n_exec"])
reward += current_reward
else:
# Explore #
explore+=1
action = np.random.choice(range(10))
current_reward = np.random.choice(bandit_data[action])
q_dict[action]["n_exec"] += 1
q_dict[action]["q_value"] = (q_dict[action]["q_value"] + current_reward)/q_dict[action]["n_exec"]
reward += current_reward
ep_explore.append(explore)
ep_exploit.append(exploit)
ep_avg_reward.append(reward/n_steps)
return q_dict, ep_explore, ep_exploit, ep_avg_reward
def reward_graph(smallest, bigger, biggest,n_episodes):
fig = go.Figure()
fig.add_trace(go.Scatter(x=list(range(n_episodes)),y=smallest,name="ep=0.001"))
fig.add_trace(go.Scatter(x=list(range(n_episodes)),y=bigger,name="ep=0.01"))
fig.add_trace(go.Scatter(x=list(range(n_episodes)),y=biggest,name="ep=0.1"))
fig.show()
In this we update the Q(a) values in an incremental manner. Based on the formula below
New Estimate = Old Estimate + StepSize[Target - OldEstimate]
The above translates in to mathematical formulation as
New_Q = Q_n + (1/n)[R_n - Q_n]
def update_q_incremental(cur_value,cur_reward,n_exec): # This all we need provide to the execute function to update q_values in the dict
return cur_value + (1/n_exec)*(cur_reward - cur_value)
Performing evaluation with n_epsiodes = 2000, n_steps = 1000, epsilon_decay = 0.001
s_time = datetime.now()
inc_q_dict_smallest, \
inc_ep_explore_smallest, \
inc_ep_exploit_smallest, \
inc_ep_avg_reward_smallest = execute(2000,1000,0.001,update_q_incremental)
print("Total elapsed time : %s" %((datetime.now()-s_time).total_seconds()))
Performing evaluation with n_epsiodes = 2000, n_steps = 1000, epsilon_decay = 0.01
s_time = datetime.now()
inc_q_dict_bigger, \
inc_ep_explore_bigger, \
inc_ep_exploit_bigger, \
inc_ep_avg_reward_bigger = execute(2000,1000,0.01,update_q_incremental)
print("Total elapsed time : %s" %((datetime.now()-s_time).total_seconds()))
Performing evaluation with n_epsiodes = 2000, n_steps = 1000, epsilon_decay = 0.1
s_time = datetime.now()
inc_q_dict_biggest, \
inc_ep_explore_biggest, \
inc_ep_exploit_biggest, \
inc_ep_avg_reward_biggest = execute(2000,1000,0.1,update_q_incremental)
print("Total elapsed time : %s" %((datetime.now()-s_time).total_seconds()))
reward_graph(inc_ep_avg_reward_smallest,inc_ep_avg_reward_bigger,inc_ep_avg_reward_biggest,2000)
IMPORTANT NOTE
Unlike what is given in the book, please remember that as we increase the epsilon_decay factor, we have LESSER EXPLORATION and MORE EXPLOITATION and visa-versa. Therefore, you would see that as we decrease the value of epsilon_decay, the rewards become much better (overall average rewards).
ep = 0.1, since this decay is very strong, therefore, very less exploration happen, and hence, very less overall rewards (Green Line)ep = 0.01, since this decay value is lower, therefore, considerable exploration happens, and hence, we have higher rewards, as a matter of fact best (Red Line)ep = 0.001, since this decay value is very very low, essentially, the agents Explores more than what is required, and exploits lesser, leading to the fact that the rewards do not maximize in longer term.Therefore, it is important to have exploration, however, it is also important for the model to exploit at later stages of episodes.
fig = make_subplots(rows=2,cols=1,
specs = [[{"secondary_y" : True}],
[{"secondary_y" : True}]])
fig.add_trace(
go.Bar(x=list(range(10)),y=[ inc_q_dict_smallest[k]["n_exec"] for k in range(10)],name="n_exec"),
secondary_y=False,
row=1,col=1)
fig.add_trace(
go.Scatter(x=list(range(10)),y=[ inc_q_dict_smallest[k]["q_value"] for k in range(10)],name="q_value"),
secondary_y=True,
row=1,col=1)
for i in range(10):
fig.add_trace(go.Box(y=bandit_data[i],name="bandit_"+str(i)),row=2,col=1,secondary_y=False)
fig.update_layout(height=600,width=800)
fig.show()
This is indeed a unique inference, as without ability to look at the distribution of rewards for every action, the agent has been able to gather that information through experiments.
inc_q_dict_smallest # for epsilon = 0.001
inc_q_dict_bigger # for epsilon = 0.01
inc_q_dict_biggest # for epsilon = 0.1
ep = 0.001 : q_values are spread out, and there is no clear distinction, which is a good action. This is because of excessive exploration done by the modelep = 0.01 : q_values are facoring Action 0, which has the highest mean rewards, and therefore seems to be best epsilon decay factorep = 0.1 : Agent tool maximum of Action 1, which had a lower q_value, because of lack of randomness. This clearly not a good epsilon decay strategy.